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Sparse Principal Component Analysis 
with missing observations 



Abstract: In this paper, we study the problem of sparse Principal Com- 
ponent Analysis (PCA) in the high-dimensional setting with missing ob- 
servations. Our goal is to estimate the first principal component when we 
only have access to partial observations. Existing estimation techniques are 
usually derived for fully observed data sets and require a prior knowledge 
of the sparsity of the first principal component in order to achieve good sta- 
tistical guarantees. Our contributions is threefold. First, we establish the 
first information-theoretic lower bound for the sparse PCA problem with 
missing observations. Second, we propose a simple procedure that does not 
require any prior knowledge on the sparsity of the unknown first principal 
component or any imputation of the missing observations, adapts to the 
unknown sparsity of the first principal component and achieves the opti- 
mal rate of estimation up to a logarithmic factor. Third, if the covariancc 
matrix of interest admits a sparse first principal component and is in addi- 
tion approximately low-rank, then we can derive a completely data-driven 
procedure computationally tractable in high-dimension, adaptive to the un- 
known sparsity of the first principal component and statistically optimal 
(up to a logarithmic factor). 

AMS 2000 subject classifications: Primary 62H12. 
Keywords and phrases: Low-rank covariance matrix, Sparse Princi- 
pal Component Analysis, Missing observations, Information-theoretic lower 
bounds, Oracle inequalities. 



1. Introduction 

Let X,Xx, . . . ,X n € W be i.i.d. zero mean vectors with unknown covariance 
matrix E = EX ® X of the form 



where a\ > a 2 > 0, 6\ E S p (the l 2 unit sphere in W) and T is a p x p 
symmetric positive semi-definite matrix with spectral norm ||T||oo < 1 and such 
that T6>i = 0. The eigenvector B\ is called the first principal component of S. 
Our objective is to estimate the first principal component 6\ when the vectors 
Xi, . . . , X n are partially observed. More precisely, we consider the following 
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framework. Denote by X^ J> the j-th component of the vector Xi. We assume that 

each component X\ is observed independently of the others with probability 
6 £ (0,1]. Note that 5 can be easily estimated by the proportion of observed 
entries. Therefore, we will assume in this paper that 6 is known. Note also 
that the case 6 = 1 corresponds to the standard case of fully observed vectors. 
Let (5i.j)i<i< n .i<j<p be a sequence of i.i.d. Bernoulli random variables with 
parameter 8 and independent from X\, . . . ,X n . We observe n i.i.d. random 
vectors Y\ , . . . , Y n £ M. p whose components satisfy 

= SijX® , 1 < i < n, 1 < j < p. (1.2) 

We can think of the Sij as masked variables. If = 0, then we cannot observe 
the j-th component of Xi and the default value is assigned to Y^' . Our goal 
is then to estimate 9\ given the partial observations Yi, . . . , Y n . 

Principal Component Analysis (PCA) is a popular technique to reduce the 
dimension of a data set that has been used for many years in a variety of different 
fields including image processing, engineering, genetics, meteorology, chemistry 
and many others. In most of these fields, data are now high-dimensional, that is 
the number of parameters p is much larger than the sample size n, and contain 
missing observations. This is especially true in genomics with gene expression 
microarray data where PCA is used to detect the genes responsible for a given 
biological process. Indeed, despite the recent improvments in gene expression 
techniques, microarray data can contain up to 10% missing observations affect- 
ing up to 95% of the genes. Unfortunately, it is a known fact that PCA is very 
sensitive even to small perturbations of the data including in particular missing 
observations. Therefore, several strategies have been developped to deal with 
missing values. The simple strategy that consists in eliminating from the PCA 
study any gene with at least one missing observation is not acceptable in this 
context since up to 95% of the genes can be eliminated from the study. An 
alternative strategy consists in infering the missing values prior to the PCA 
using complex imputation schemes [2, 5]. These schemes usually assume that 
the genes interactions follow some specified model and involve intensive com- 
putational preprocessing to imput the missing observations. We propose in this 
paper a different strategy. Instead of building an imputation technique based 
on assumptions describing the genome structure (about which we usually have 
no prior information), we propose a technique based on the analysis of the per- 
turbations process. In other words, if we understand the process generating the 
missing observations, then we can efficiently correct the data prior to the PCA 
analysis. This strategy was first introduced in [7] to estimate the spectrum of 
low-rank covariance matrices. One of our goal is to show that this approach 
can be successfully applied to perform fast and accurate PCA with missing 
observations. 

Standard PCA in the full observation framework (S = 1) consists in extracting 
the first principal components of £ (that is the eigenvector 9\ associated to the 
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largest eigenvalue) based on the i.i.d. observations X%, ■ ■ ■ ,X n : 

6 = argmax e T e=1 6 |T S, l 0, (1.3) 

where E„ = ^ Yl7=i X%Xj . The standard PCA presents two majors drawbacks. 
First, it is not consistent in high-dimension [3, 10, 11]. Second, the solution 8 is 
usually a dense vector whereas sparse solutions are prefered in most applications 
in order to obtain simple interpretable structures. For instance, in microarray 
data, we typically observe that only a few among the thousands of screened genes 
are involved in a given biological process. In order to improve interpretability, 
several approaches have been proposed to perform sparse PCA, that is to enforce 
sparsity of the PCA outcome. See for instance [12, 13, 17] for SVD based iterative 
thresholding approaches. [18] reformulated the sparse PCA problem as a sparse 
regression problem and then used the LASSO estimator. See also [9] for greedy 
methods. We consider now the approach by [4] which consists in computing a 
solution of (1.3) under the additional ^i-norm constraint \9\i < s for some fixed 
integer s > 1 in order to enforce sparsity of the solution. The same approach with 
the Zi-norm constraint replaced by the Iq-tioyttl gives the following procedure 

9 = argmax ee5P . | fl | o < 5 (6» T £„6») , (1.4) 

where \9\o denotes the number of nonzero components of 9. In a recent paper, 
[16] established the following oracle inequality 

(n§jj - 8 1 8j h y < c ( ^_y s ^m, 

V / \<Ji - (T2 / n 

for some absolute constant C > 0. Note that this procedure requires the knowl- 
edge of an upper bound s > \9i\o. In practice, we generally do not have access 
to any prior information on the sparsity of Q\. Consequently, if the parameter 
s we use in the procedure is too small, then the above upper bound does not 
hold, and if s is too large, then the above upper bound (even though valid) is 
sub-optimal. In other words, the procedure (1.4) with s = \9i\q can be seen as 
an oracle and our goal is to propose a procedure that performs as well as this 
oracle without any prior information on |^i lo- 
in order to circumvent the fact that |#i|o is unknown, we consider the follow- 
ing procedure proposed by [1] 

0! = argmax ee5P (6 T Z n 6 - X\9\ ) , (1.5) 

where A > is a regularization parameter to be tuned properly. [1, 6] stud- 
ied the computational aspect. In particular, [6] proposed a computationally 
tractable procedure to solve the above constrained maximization problem even 
in high-dimension. However none of these references investigated the statistical 
performances of this procedure or the question of the optimal tuning of A. We 
propose to carry out this analysis in this paper. We establish the optimality of 
this procedure in the minimax sense and explicit the optimal theorethical choice 
for the regularization parameter A. 
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When the data contains incomplete observations (5 < 1), we do not have 
access to the empirical covariance matrix £„. Given the observations Y\, . . . ,Y n , 
we can build the following empirical covariance matrix 

i=l 

As noted in [7], £„ is not an unbiased estimator of £, Consequently, we need 
to consider the following correction in order to get sharp estimation results: 

£ n = (6^-6^)dmg(^)+5-^\ (1.6) 

Indeed, we can check by elementary algebra that £ n is an unbiased estimator 
of £ in the missing observation framework S £ (0, 1]. Therefore, we consider the 
following estimator in the missing observation framework 

§i = argmax ee5P . {e]o ^ (e T t n 9 - A|0| o J , (1.7) 

where A > is a regularization parameter to be tuned properly and s is a mild 
constraint on |#i|o- More precisely, s can be chosen as large as lo g(" p ) when 
no prior information on |#i|o is available. We will prove in particular that the 
procedure (1.7) adapts to the unknown sparsity of B\ provided that |#i|o < s. 
We also investigate the case where £ is in addition approximately low-rank. 
In that case, we can remove the restriction \6\q < s (taking s = p) in the 
procedure (1.7) and propose a data-driven choice of the regularization parameter 
A. We will show that this data-driven procedure also achieves the optimal rate 
of estimation (up to a logarithmic factor) in the missing observation framework 
5 £ (0, 1] without any prior knowledge on |#i|o- Finally, we establish information 
theoretic lower bounds for the sparse PCA problem in the missing observation 
framework S € (0, 1] with the sharp dependence on 5, thus expliciting completely 
the effect of missing observations on the sparse PCA estimation rate. Note that 
our results are nonasymptotic in nature and hold for any setting of n, p including 
in particular the high-dimensional setting p > n. 

The rest of the paper is organized as follows. In Section 2, we recall some 
tools and definitions that will be useful for our statistical analysis. Section 3 
contains our main theoretical results. Finally, Section 4 contains the proofs of 
our results. 



2. Tools and definitions 

In this section, we introduce various notations and definitions and we recall 
some known results that we will use to establish our results. 

The Z g -norms of a vector x = [x^ , • • • , x^) £ MP is given by 



for 1 < q < oo, and |a;|oo = max la;^! 

i<i<p 
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The support of a vector x = {x^ x \ • • • , a;( p )) T <G W is denned as follows 

J(x) = {j : ^ 0} . 

We denote the number of nonzero components of x by |a;|o- Note that |x|o = 
\J(x)\. Set S' p = {xeW } : \x\ 2 = 1}. For any J G \p], we define S P {J) = 
{x G S p : J(x) = J}. For any integer 1 < s < p, we define S p — {x G S p : \x\o = s}. 
Note that SJ = U Je[p] . ^^(J). 

For any p x p symmetric matrix A with eigenvalues a\{A) 1 ■ ■ ■ , a p (A), we 
define the Schatten g-norm of A by 

1141, = rpWA^WJ ■ VI < 9 < 00, and = mj« {K(,4)|}. 

Define the usual matrix scalar product (A, B) = tr(A T _B) for any A,BG R pxp . 
Note that ||^4|| 2 = y/(A,A) for any A G W pxp . Recall the trace duality property 

\{A,B)\ < H^lloollBlIx, VA,BeR pxp . 

We recall now some basic facts about e-nets (See for instance Section 5.2.2 
in [15]). 

Definition 1. Let (A,d) be a metric space and let e > 0. A subset M e of A is 
called an e-net of A if for every point a G A, there exists a point b G AC so that 
d(a, b) < e. 

We recall now an approximation result of the spectral norm on an e-net. 

Lemma 1. Let A be a k x k symmetric matrix for some k > 1. For any 
e G (0, 1/2), there exists an e-net A/" e C S k ( the unit sphere in R fe J such that 

\K\< (l + Q , 

and 

sup |(Aar,a;)| < 1 sup |(Ac,a;)|. 

See for instance Lemma 5.2 and Lemma 5.3 in [15] for a proof. 
We recall now the definition and some basic properties of sub-exponential 
random vectors. 

Definition 2. The ip a -norms of a real-valued random variable V are defined by 
\\V\\^ a = inf {it > : Eexp(|VTAi Q ) < 2} , a > 1. 

We say that a random variable V with values in R is sub- exponential if\\V\\^ a < 
co for some a > 1. If a = 2, we say that V is sub-gaussian. 
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Wc recall some well-known properties of sub-exponential random variables: 

1. For any real- valued random variable V such that \\V\\ a < oo for some 
a > 1, we have 

E\V\ m <2—r(—) \\V\\T , Vm>l (2.1) 

where T(-) is the Gamma function. 

2. If a real- valued random variable V is sub-gaussian, then V 2 is sub-exponential. 
Indeed, we have 

\\V 2 U 1 <2\\V\\l 2 . (2.2) 

Definition 3. A random vector X € W is sub- exponential if (X,x) are sub- 
exponential random variables for all x £ W . The ip a -norms of a random vector 
X are defined by 

\\X\ka = SU P \\( X >x)Ua, 01 ^ L 

We recall a version of Bernstein's inequality for unbounded real-valued ran- 
dom variables. 

Proposition 1. Let Y\, . . . ,Y n be independent real-valued random variables with 
zero mean. Let there exist constants a, a' and K such that for any m > 2 



-^|E[F™]I < — K m - 2 a 2 , and I^E^H < —K m ~ 2 (a') 2 . (2.3) 

71/ 2 Tt 2 

1=1 1=1 
Then for every t > 0, we have with probability at least 1 — 2e~* 

1 n 

71 ^ ' 



n , 
i—i 



< a\l—+K-. 

n n 



Remark 1. In the usual formulation of Bernstein's inequality, only the sec- 
ond moment condition in (2.3) is imposed and the conclusion holds valid with 
a replaced by a' . The refinement we propose here is necessary in the missing 
observation framework in order to get the sharp dependence of our bounds on 
S. An investigation of the proof of Bernstein's inequality shows that this re- 
finement follows immediately from Chernoff's bound used to prove the standard 
Bernstein's inequality (See for instance Proposition 2.9 in [8]). Indeed, using 
Chernoff 's approach, we need a control on the following expectation 



t 2 E \Y 2 ] 
E[$(tYi)] = ^+E 



°° t k Y k 



E 



fc! 

Lfc=3 



t 2 E\Y 2 } ~i*E[K fc l / 1 

+ yt H°'K 



fc=3 



where we have used the second moment condition in (2.3) and Fubini's theorem 
to justify the inversion of the sum and the expectation. The rest of the proof is 
left unchanged. 
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3. Main results for sparse PCA with missing observations 

In this section, we state our main statistical results concerning the procedure 
(1.7). We will establish these results under the following condition on the dis- 
tribution of X. 

Assumption 1 (Sub-gaussian observations). The random vector X £ MP is 
sub-gaussian, that is ||X|j^, 2 < oo. In addition, there exist a numerical constant 
ci > such that 

E((X,u)) 2 > Cl \\(X,u}\\l 2 , Vu G R p . (3.1) 



3.1. Oracle inequalities for sparse PCA 

We first establish a preliminary result on the stochastic deviation of the following 
empirical process 

ZJs) = max ( 6> T (£„ - £)0 ) , VI < s < p. 
To this end, we introduce the following quantity 



Cn{s,p,t,S) := max 



t + s log(ep/s) t + s log(ep/s) 
S 2 n S 2 n 



Proposition 2. Let Assumption 1 be satisfied. Let Yi,--- ,Y n be defined in 
(1.2) with 5 G (0, 1]. Then, we have 

p (fj { z «w < c ^xr Cn(s,p '* ,5) }) - 1 - e ^ (3 - 2) 

where c > is an absolute constant and cr max (s) = max 9e5 p (# T £#). 
We can now state our main result 

Theorem 1. Let Assumption 1 be satisfied. Let Y\, ■■ ■ ,Y n be defined in (1.2) 
with 8 G (0,1]. Consider the estimator (1.7) with parameters s satisfying n > 
6~ 2 (s + 1) log(ep/s) and 

X = C ° l l °f P \ (3.3) 
(j\ — (72 on 

where C > is a large enough numerical constant. 

If |$i|o < 9, then we have, with probability at least 1 — -, that 

\\§ 1 §J-e 1 eJ\\l<c'\9 1 \ a 2] ^-. 

o z n 

where a = a "l a ^ and C > is a numerical constant that can depend only on 

Ci. 
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1. We observe that the estimation bound increases as the difference o\ — (J2 
decreases. The problem of estimation of the first principal component is 
statistically more difficult when the largest and second largest eigenvalues 
are close. We also observe that the optimal choice of the regularization 
parameter (3.7) depends on the eigenvalues cti,CT2 of S. Unfortunately, 
these quantities are typically unknown in practice. In order to circumvent 
this difficulty, we propose in Section 3.3 a data-driven choice of A with 
optimal statistical performances (up to a logarithmic factor) provided that 
E is approximately low-rank. 

2. Let now consider the full observation framework (S = 1). In that case, if 
l^i I o < s, we obtain the following upper bound with probability at least 
1 1 



p 



IIMi -QiQ[ || ^ < C"|^i|o^ 

11 

We can compare this result with that obtained for the procedure (1.4) by 
[16] 



(E|| 



<c's* Mep/s \ 



We see that in order to achieve the rate l^ilo^ 2 log(ep/|6' 1 1 ) with the 
procedure (1.4), we need to know the sparsity of Q\ in advance, whereas 
our procedure adapts to the unknown sparsity of Q\ and achieves the 
minimax optimal rate up to a logarithmic factor provided that |#i|o < s 
(see Section 3.4 for the lower bounds). This logarithmic factor is the price 
we pay for adaptation to the sparsity of 6\ . Note also that we can formulate 
a version of (1.4) when observations are missing (5 < 1) by replacing £„ 
with £„. In that case, our techniques of proof will give with probability 

at least 1 — - 

p 

llW-Mfiii<c^-^* /s) 



S 2 n 

We discuss now the choice of s. In practice, when no prior information on 
the sparsity of B\ is available, we propose to choose s = i g(" p ) — 1- Then, 
the procedure (1.7) adapts to the unknown sparsity of 9\ provided that 
the condition |#i|o < ^{ep) ~ ^ ^ s satisfied, which is actually a natural 
condition on 9\ in order to obtain a non trivial estimation result. Indeed, if 
l^i |o ^ S 2 n/ log(ep), then the upper bound in Theorem 1 for the estimator 
9\ becomes larger than a 2 > 1 whereas the bound for the null estimator 
is 110-0x07111 = 1. 

In the case where observations are missing (5 < 1), Theorem 1 guarantees 
that recovery of the first principal component is still possible using the 
procedure (1.7). We observe the additional factor 5~ 2 . Consequently, the 
estimation accuracy of the procedure (1.7) will decrease as the proportion 
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of observed entries S decreases. We show in Section 3.4 below that the 
dependence of our bounds on 5~ 2 is sharp. In other words, there exists no 
statistical procedure that achieves an upper bound without the factor 5~ 2 . 
Thus, we can conclude that the factor 5~ 2 is the statistical price to pay 
to deal with missing observations in the principal component estimation 
problem. If we consider for instance microarray datasets where typically 
about 10% of the observations are missing (that is 8 = 0.9), then the opti- 
mal bound achieved for the first principal component estimation increases 
by a factor 1.24 as compared to the full observation framework (5 = 1). 



3.2. Study of approximately low-rank covariance matrices 

We now assume that S defined in (1.1) is also approximately low-rank and 
study the different implications of this additional condition. We recall that the 
effective rank of £ is defined by r(£) = tr(E)/||£|| 00 where tr(E) is the trace 
of E. We say that E is approximately low-rank when r(£) <C p. Note also that 
the effective rank of a covariance matrix can be estimated efficiently by r(E) 
where E is an acceptable estimator of E. See [7] for more details. Thus, the 
approximately low-rank assumption can easily be checked in practice. 

First, we can propose a different control of the stochastic quantities Z„(s). 
Note indeed that Z„(s) < ||E„ — E||oo for any 1 < s < p. We apply now Propo- 
sition 3 in [7] and get the following control on Z n (s). Under the assumptions of 
Proposition 2, we have with probability at least 1 — e _t that 



I.* ^ /r(£)(i + log(2p)) r(£) (t + log(2p)) . . M . 

E„-£ oo <C— max J Jy 6 ^ ;; , M &K F " ( Cl S + t + logn }, 
ci I V o n o^n 

(3.4) 

where C > is an absolute constant. We concentrate now on the high-dimensional 
setting p > n. Assume that 

n>c r ^fH (3-5) 

for some sufficiently large numerical constant c > 0. Taking t = log(ep), we get 
from the two above displays, with probability at least 1 — — that 

ep 



- , r(S)log(ep) 
||S„ - E||oo < ca x \l , 

where d > can depends only on c\. Combining the previous display with 
Proposition 2 and a union bound argument, we immediately obtain the following 
control on Z„(s). 

Proposition 3. Let the conditions of Proposition 2 be satisfied. In addition, let 
(3.5) be satisfied. Then we have 

p (n koo < caW^isMmJ^X) > i - \, 
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where c > is numerical constant that can depend only on c\ . 

The motivation behind this new bound is the following. When (3.5) is satis- 
fied, we can remove the restriction |#|o < s in the procedure (1.7). We consider 
now 

0! = argmax ee5P (0 T E„0 - \\0\ ) ■ (3.6) 



Then, a solution of this problem can be computed efficiently even in the high- 
dimcnsional setting using a generalized power method (see [6] for more details 
on the computational aspect), whereas it is not clear whether the same holds 
true for the procedure (1.7) with the constraint |6*i| < s. 

We now consider the statistical performance of the procedure (3.6). Following 
the proof of Theorem 1, we establish this result for 6\. 

Theorem 2. Let Assumption 1 be satisfied. Let Y\, ■■ ■ ,Y n be defined in (1.2) 
with S € (0,1]. In addition, let (3.5) be satisfied. Take 

X = C ° l l °f P \ (3.7) 

where C > is a large enough numerical constant. Then we have, with proba- 
bility at least 1 — ~, that 



hel-e 1 ej\\l<c\e 1 \ a 2log{ep) 



S 2 n 

where a = — — — and C > is a numerical constant that can depend only on 

Note that this result holds without any condition on the sparsity of 0\. Of 
course, as we already commented for Theorem 1, the result is of statistical 

A 2 

interest only when Q\ is sparse: |6*i|o < - 2 io g " e p) • ^ ne m terest of this result is to 
guarantee that the computationally tractable estimator (3.6) is also statistically 
optimal. 



3.3. Data-driven choice of A 

As wc see in Theorem 2, the optimal choice of the regularization parameter 
depends on the largest and second largest eigenvalues of E. These quantities 
are typically unknown in practice. To circumvent this difficulty, we propose the 
following data-driven choice for the regularization parameter A 

A_d = C- — 75 ^ , (3.8) 

(Tl — (T2 o z n 

where C > is a numerical constant and a\ and 02 are the two largest eigen- 
values of S„. If (3.5) is satisfied, then as a consequence of Proposition 3 in [7], 
(Ti and (7 2 arc good estimators of o~\ and 02 even in the missing observation 
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framework. In order to guarantee that Ad is a suitable choice, we will need a 
more restrictive condition on the number of measurements n than (3.5). This 

new condition involves in addition the "variance" a 2 = -, — — — vy: 

?i>c^r(S)log 2 (ep), (3.9) 

where c > is a sufficiently large numerical constant. As compared to (3.5), 
we observe the additional factor a 2 in the above condition. We already noted 
that matrices £ for which the difference o\ — is small are statistically more 
difficult to estimate. We observe that the number of measurements needed to 
construct a suitable data-driven estimator also increases as the difference o~\ — o~i 
decreases to 0. 

We have the following result. 

Lemma 1. Let the conditions of Proposition 2 be satisfied, Assume in addition 
that (3.9) is satisfied. Let Xd be defined in (3.8). Then, we have with probability 
at least 1 — - that 

Z n (s)<X D s, Vl<s<p, 



and 



x D <c al logM 



o~\ — o~2 S 2 n ' 
for some numerical constant C > 0. 

Consequently, the conclusion of Theorem 2 holds true for the estimator (3.6) 
with A = Xd provided that (3.9) is satisfied. 



3-4- Information theoretic lower bounds 

We derive now minimax lower bounds for the estimation of the first principal 
component B\ in the missing observation framework. 

Let Si > 1. We denote by C — C Sl (o~i,o-2) the class of covariance matrices 
S satisfying (1.1) with o\ > a-zi #i G S p with |0|o < Si and T is a p x p 
symmetric positive semi-definite matrix with spectral norm IjTHoo < 1 and such 
that T6\ = 0. We prove now that the dependence of our estimation bounds 
on o"i — (72, S, si, n,p in Theorems 1 and 2 is sharp in the minimax sense. Set 



Theorem 3. Fix 6 € (0, 1] and s\ > 3. Let the integers n,p > 3 satisfy 

2a 2 Sl log(ep/si) < S 2 n. (3.10) 

Let X\, . . . ,X n be i.i.d. random vectors in M. p with covariance matrix E G C. 
We observe n i.i.d. random vectors Y\, . . . , Y n € M. p such that 



Y>=5ijXV\ l<i<n,l<j<p, 
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where (fli,j)i<t<n, l<j<p * s an i-i-d. sequence of Bernoulli B (8) random variables 
independent of X\ , . . . , X n . 

Then, there exist absolute constants j3 G (0, 1) and c > such that 

inf sup p s (||m7- helWl >™ 2 4^- log f— H > A (3-ii) 

Si sec V b z n \si) ) 

where inf g denotes the infimum over all possible estimators 6\ of B\ based on 
Y\ , . . . , Y n . 

Remark 2. For Si = 1, we can prove a similar lower bound with the factor 
5~ 2 replaced by 5~ l . This is actually the right dependence on S for l-sparse 
vectors. We can indeed derive an upper bound of the same order for the selector 
ej = argmax 1<J<)9 ^eJS„ej^ where e\, . . . ,e p are the canonical vectors of MP. 

For si = 2, we can prove a lower bound of the form (3.11) without the loga- 
rithmic factor by comparing for instance the hypothesis 8q = ^7|( e i + e 2) and 

8i = |ei + ^r&y,- Getting a lower bound for s\ = 2 with the logarithmic factor 
remains an open question. 

4. Proofs 

4- 1. Proof of Proposition 2 



PROOF. For any s > 1, we have 

z„( s )<<5- 1 zW( s ) + r 2 z( l 2) W (4.1) 

where 

Z«( S ) = max{|^diag - E«) o\} , Z^{s) = max{|^ - A™) o\} 

with A { n S) = E { n ] - diag(sl 5) ) and = £(*) - diag(£< 4 >). 

Before we proceed with the study of the empirical processes z[P(s) and 

(2) 

Z„ (s), we need to introduce first some additional notations. Define 

r = (pW,...,^W) T , 

where Si,...,5 p are i.i.d. Bernoulli random variables with parameter d and in- 
dependent from X. Denote by E<5 and Ex the expectations w.r.t. (<5i, • • • , S p ) 
and X respectively. We also note that for any m > 2 and any 9 £ S p , we have 
E[(9 T YY T 9) m ] = E X E 5 [(d T YY T 9) m ] = E 5 E X [(6 T YY T 9) m ] . This comes 
from Fubini's theorem and the fact that X is sub-gaussian. 
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We now proceed with the study of 7ih\s). For any s > 1 and any fixed 
9 £ Sf, we have 

9 T (4«) - AW) ^ = - E [ 0T ( K '^ T - dia 8'(^^ T )) A - 5 2 9 T (S - diag(E)) 0] . 



Set 
and 



Zi = [(YiY? - diag^lf )) - (5 2 (E — diag(E))] 



Z = [(YY T - diag(yr T )) - S 2 (S - diag(E))] . 

Wc note that Z.Zi,--- ,Z„ are i.i.d. Wc now study the moments E [(9 J Z9) m ] 
and E [|6» T Z6»| m ] for any 9 £ Sf and m > 2. 
Note first that 

|(9 T (E-diag(E))(9| <max(6> T E(9, (9 T diag(E)(9) < a max (s), V(9 e <S? , Vs > 1, 
where <7 max (s) = niax ug5 f (u t Em). (We have indeed that max u€S p (it T diag(E)w" 

Thus wc get, for any m > 2 and € <Sf , that 



and 



Z0) 



6> 1 Z6> 



< 2 



< 2 r 



m-lv 



? T (rr T -dia g (yy T ))6') ? 



(yy T - diag(ry T )) e\ 



(4.2) 



(4.3) 



For any 9 £ S p and <5i, • • • , <5 P , we set 9 S = (<5i(9 (1) , • • • , S P 9^) T and we note 
the following simple fact that we will use in the next display: j^lo < |0|o with 
probability 1. We also set W = XX T + (5-l)di&g(XX T ) = (Wj,k)i<j,k< P - Note 



that Wj.k = X^X^ for any j ^ k and 
For any m > 2 and 8 £ Sf, we have 



S(X^) for j = l,...,p. 



? T (YY T - diag(ry T ))6»)' 



= E 



1] (XX T -dmg(XX T ))9 s y 



< 2 



m — 1 



E 



(9jW9 s ) 
+ 2 



m-lgmj 



9jdiag(XX T 



(4.4) 



Next, we have for any m > 2 and 9 £ that 

p p m 

(^*r= e ••• e 
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Taking the expectation, we get for any m > 2 and 9 € <Sf that 

p p m 



E(eJW9 s ) = E X E S 



e e n 

3iM = l jm,k m =lt=l 



g(jt)/)(fct)„, 



= E X 5 m •• E ^^fl^X^X^ 

ii,fci=l j m ,k m =lt=l 

= 8 m E x [(9 T X) 2m ] 
< 25 m m! {2\\9' T X\\l 2 ) m 



< 2S m m\ -a max (s) 
\ci 

where we have used (2.1), (2.2) and Assumption 1 in the last two lines. 
Similarly, we have for any m > 2 and 9 € <SJ that 



(4.5) 



(^ T diag(XX T ) 



E W )x(3) 



< E 



-U) 



<ib( ou) ) 2 ( xU) 

where we have used the convexity of x — > x m and the fact that 9 <E S p in the 
last line. Taking now the expectation, we get for any m > 2 and 9 £ <Sf 



E (9j diag(X X T )9 s ) m < £ (d U) ) E 

i=i 



X U) 



2m 



< 2m! 2 max \\X^\\1 



< 2m! — max (£,■ ,• 
ci o 



< 2m! — cr max (s) 



(4.6) 



where we have used again (2.1) and (2.2) in the second line line and Assumption 
1 in the third line. 

Combining (4.4), (4.5) and (4.6), we get for any m > 2 and any 9 <G Sf that 

'AS 



lT (rr T - diag(ry T ))0)" 1 < 2m! ( ^<r max (s) 



(4.7) 
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We now plug the above bound in (4.2) to get for any m > 2 and any 8 G <Sf 
that 



E 



< 



i! / 8V26 



f'i 



80-maxO) 



Cl 



+ i (25 2 a max ( S ))' 



771! / 165 



2 / o c \ \ rn—2 

ci A 1 



(4.8) 



Similar (and actually faster computations) give for any m > 2 and any 9 € Sf 
that 



E 



e T (YY T -diag(yy 1 ))e\ 



< 2 m - 1 E s E x 



X) 



2 m - 1 E (5 E x 



di& g {x x T )9 S y 



< 2 m m!E A - 



[nojxwi 







+ 2 m m! 


^CTmax(s)J 







< 2 m m! (2||0 T X||2 2 ) + m! ( -cr max (s) 



< 2m! — <7 max (s) 



since for any S = (5i, ...,S P ) G {0, 1} P , I^JXH^ < ||6» T X|| V , 2 . Thus, we get for 
any m > 2 and any 9 G <SJ that 



6»' 



< 



! / 16 



2 VciAl 



^max(s) 



ci A 1 



m-2 



(4.9) 



We see that the moments conditions in (2.3) are satisfied with K — -^jj<r meix (s) , 
a 1 = ^jcrmax(s) and a = 8a' . Thus, for any fixed 9 G <Sf, Bernstein's inequality 
gives for any tf > that 



> 



16V2 

ci A 1 



Sa- max (s)\ h 



77, Cl Al 71 



Note now that 

Z< 2) (s) = max ( e T (Ai s) - A (5) )r9 j = max max ( 9 T (A^ - A {s) )9 ) 
oes? I J Je[p]:\J\=s eesp(j) I J 



For any fixed J G [p] such that |J| = s, Lemma 1 guarantees the existence of a 
i-nct Af(J) sucht that \Af(J)\ < 9 s and 



max 

9eSp(J) 
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Combining the last three displays with an union bound argument, we get for 
t' = t + s log(9) + s log (f) and t > that 



(4.10) 



with 



i + 5 log(9) + s log (f ) i + s log(9) + s log (f ) 



We proceed similarly to treat the quantity Z„ (s). We first note that 



1 n p , „ 

T (diag(S(f) - E W)) * = i ££ ( - «S« (* W 



i=l j = l 

Next, proceeding essentially as in (4.6), we get for any m > 2 and any 9 £ S% 



< E r 5 

< E(^ } ) 2 

i=i 

< E(^ } ) 2 

< f> w ) s 



y(j) 



5 e jj 



-A' 



25m! 



(2||X«| 



2fm!(-E w ] + <5™£™ 



ml ( Wf rz ,\ ^2a max (l) N " 
-V0cr ma x(l) 



— — i 

2 V ci A 1 / V c i Al 

Then, for any £ Sf, Bernstein's inequality gives for any tf > that 

max (l)f 



diag^-S^) 



> 



■- < 2e 



ci A 1 V n ciAl n 
Next, a similar union bound argument as we used above for 7i n 2 \s) gives 

P(Z« (*)>#>(*,*)) <2e- 4 , (4.11) 
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with 




34,5ft + slog(9) + s log (f )) 2ft + slog(9) + slog (f )) 



n n 



Next, easy computations give kCn\s,t) + (s,t) < ( n (s,t) where 




t + slog(9) + slog(f) 
S 2 n 



) 



Combining (4.1), (4.11) and (4.10) with a union bound argument, we get, for 
any s = 1, . . . , p, that 



Finally, using again a union bound argument, we get from the previous display 
that 



Replacing t by t + log(ep) and up to a rescaling of the constants, we get that 

' (n > m « (^S^ i±1 ^ M ) }) <- 

for some numerical constant C > 0. 
4-2. Proof of Theorem 1 

We will use the following lemma in order to prove our results 

Lemma 2. Let 9 <E S p . Let E € M pxp &e a symmetric positive semi- definite 
matrix with largest eigenvalue <j\ of multiplicity 1 and second largest eigenvalue 
<72- Then, for any 8 £ S p , we have 



P(Z„(» > Cn(M)) <4e-'. 




5(0-1 - a 2 )||^ T - M^lll < (E, 6x61 - 06 T )- 



See Lemma 3.2.1 in [16] for a proof of this result. 



K. Lounici/ Sparse PCA with missing observations 



18 



PROOF. Wc have by definition of Q\ and in view of Lemma 2 that 



< (E - E„, 0i0^ - §JJ ) + (En, 0^7 - 



<( E — E„, 010^ — Q\Q~[\ + 6j Ti n 9i — \\6i\o 

JtJx - A|0i| o ] + A|0i| o -A|0i| o 

< ^E-E n) M7-M7)+A|^i|o-A|ei| 

< ||n JUi7i (s - son^juV^flJ" - ^^|| 2 

+ A|0 1 | o -A|0 1 | O) 



where II j uJi is the orthogonal projection onto l.s.(ej, j € JU Ji), J = J{9\) 
and Ji = J(6»i). 

Thus we get 



10], - 0101 || 2 < — - — ll n ju./i( S ~ S ™) n juJill°°ll^i^i ~0i0i lb 



2-y/2 



T a flTi 



(71 - (7 2 



A |0i|o-|0i|o 



(71 - (72 

Set A = \\kSl-Bx9jh, = ^lln^E-E^n^JU and 7 = ^A (|0i| o - |0i|o 
The above display becomes 

A 2 - (3 A - 7 < 0. 

Next, basic computations on second order polynoms yield the following neces- 
sary condition on A 



A< 



where we have used concavity of x — > \fx. 
Set s\ = |0i|o and s\ = |0i|o- Next, we have 



P + 2 7 < 



16 



((7l - (7 2 ) 2 



njllL-^r^Asi 



16 



((7l - (7 2 ) 2 



n 7l ( s n — e ) n 



2A 



(71 - CT 2 



"Si 



< 



--^^maxj^^-^-^lA, 

((7i — (72j^ l<s<s (_ 8 

16 



((71 - (7 2 ) ; 



[Z„(si)] 5 



2 A 



-Si. 



(71 - (7 2 

(4.12) 
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Next, we have in view of Proposition 2 and under the condition s \og(ep/s) < 
S 2 n, with probability at least 1 — i that 

Z„(s) < c 2 j= , VI < s < s. 

cjM d 2 n 

Thus, we get with probability at least 1 — ± that 

r~ / v. 2 ^ „2 CT max( s l) ( S l + l)log(^/£l) 

qAl (5 2 n 

and 



max f [Z„( S )] 2 - ^^Xs) < 0, 



if we take 

A = C a * l ° geP 
o i — o"2 <5 2 n ' 

where C > is a large enough numerical constant. 

Combining the last three displays with (4.12), we get with probability at least 

1 - ± that 
v 

P +27<C- Tl s i— ro ' ( 4 - 13 ) 

(cti - o- 2 ) 2 crra 

where C > is a numerical constant. 



4-3. Proof of Lemma 1 



PROOF. A standard matrix perturbation argument gives | frj ;— <7j | < ||S n — £||oo, 
VI < j < p. Consequently, we get 

fl - — S||oo < <i\ < fi + ||S„ — S||oo, 

(7i — (72 = (Ji — (Ti + (Ti — (72 + (T2 — 6"2 

> (7i - (72 - (|<7i - £71 1 + |0"2 - (72 1) 

> (7i - (7 2 - 2||S„ - SHoo 

and similarly 

<J\ - <7 2 < (7l - (7 2 + 2||E n — Sjloo. 

Combining now (3.4) with (3.9) with a sufficiently large constant c > 0, we get 
with probability at least 1 — | that 

7:0-1 < 5r\ < 2(7i, -(f7i - C7 2 ) < a x - a 2 < 2(<r 1 - oi), 
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and 

a \ < a \ < °°i 

8((Ji — (T2) ~~ 0"! — (72 ~~ (J 1 (J2 

The conclusion follows immediately. 
4.4. Proof of Theorem 3 

This proof uses standard tools of the minimax theory (cf. for instance [14]). 
The proof is more technical in the missing observation case (5 < 1) in order the 
get the sharp dependence S~ 2 factor. In order to improve readability, we will 
decompose the proof into several technical facts and proceed first with the main 
arguments. Then, we give the proofs for the technical facts. 

PROOF. We consider the following class C of p x p covariance matrices 

C = {S e = E(0, (71,(72) = ai99 T + (7 2 (/ p - 99 T ), \/6eS p : \9\ < si, V<ti > (1 + 17)0-3 > 0} , 

(4.14) 

where I p is the p x p identity matrix and 77 > is some absolute constant. 

Note that the set C contains only full rank matrices with the same determi- 
nant and whose first principal component 9 is si-sparse. Note also that C C C. 
Indeed, it is easy to sec that o\ is the largest eigenvalue of E with multi- 
plicity 1 and associated eigenvector 9 with less than s\ nonzero components, 
\\I P - 99 T \\ oc = 1 and (J p - 99 T )9 = 0. 

Next, we define uj = (1, 1, 0, ■ ■ • , 0) G {0, 1} P and 



n 



{w= (uj^\--- G {0,1} P : w« =w( J ) - 1, Mo = .si}u{^ }. 



A Varshamov-Gilbert's type bound (see for instance Lemma 4.10 in [8]) guar- 
antees the existence of a subset AT C f2 with cardinality log (Card (AT)) > 
Ci(s\ — 2) log(e(p — 2)/(si — 2)) containing ojq such that, for any two distinct 
elements u and ui' of AT, we have 

1 n ^ Sl 
|o > — 

where C\ > is an absolute constant. 



Set e = ay a Sl '^f^^ 51 ^ for some numerical constant a G (0, l/y/2). Note 
that we have e < 1/2 under Condition (3.10). Consider now the following set of 
normalized vectors 



9 = ^ 9(u) 



; (3) e w (p) £ 



2 'V 2 'V^2 'Vii 



U {#0 = ^ w c[}- 
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Note that |6| = |Af| and \6\ < s 1 for any 9 G 9. 



Lemma 3. For any a > and any distinct 61,62 G 0, we have 

iiM7-^ T iii>^ 2 sii °$r /si) - ^ 



Clearly, for any # G 0, we have € C. We introduce now the class 

C(0) = {s eC : See). 

Denote by P s the distribution of (Yi, • • ■ , F„). For any 9, 9' eS p , the Kullback- 
Leibler divergences if (Ps e , , Ps e ) between P^ g , and Ps e is defined by 



K (Pe,„Pe.) =Es 9 , log 



We have the following result 

Lemma 4. LetXi,.. .,X n eW } be i.i.d. iV(0,E) ura'tfi S = S e G C(9). Assume 
that 21 > 1 -f ?y /or some absolute n > 0. Taking a > sufficiently small, we 
have for any 9' G <S P , i/iai 

^,PfcJ<Y-iipg(f). 

Thus, we have that 

- 1 ^ A'(P S „P S J < alog(Caxd(e)-l) (4.17) 

is satisfied for any a > if a > is chosen as a sufficiently small numerical 
constant depending on a. In view of (4.16) and (4.17), (3.11) now follows by 
application of Theorem 2.5 in [14]. 

4-5. Proof of Lemma 3 

PROOF. For any distinct #i,#2 G 6, we have 



2 



1 2 a 2 _ 2 s 1 \og(ep/s 1 ) 



1 12 - 8 8 S 2 n 

Next, we need to compare — '^Jl^ to \6\ — #2 (2- 
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For any 6\ , 6 2 £ , we have 



p^J -8 2 e~l\\l = 2-2{6l6 2 ) 2 

= \6 1 \ 2 2 + \6 2 \l-2{6l6 2 ) 2 

= \e 1 -e 2 \l + 2[{eje 2 )-{eje 2 ) 2 ]. 



We immediately get from the previous display that — 6 2 6j\\ 2 > \&i — 6 2 \ 2 

for any 61,82 £ O. 

4-6. Proof of Lemma 4 

Recall that Xi, . . . ,X„ £ W are i.i.d. N (0, S) with S = E fl e C(9). For any 
1 < i < n, set Si = (Si.i, ••• , ^i,p) T £ K p . We note that <5i, . . . , S n are random 
vectors in M. p with i.i.d. entries Si.j ~ -B(<5) and independent from (Xi, • ■ ■ , X„). 
Recall that the observations Y\,...,Y n satisfies Y± = SijX^. Denote by 
Ps the distribution of (Yi, ■■• ,Y n ) and by P^P the conditional distribution 
of (Yi, ■ • • , Y n ) given (8\, ■ ■ ■ , S n ). Next, we note that for any 1 < i < n the 
conditional random variables Yi | (6\, ■ ■ ■ ,8 n ) are independent Gaussian vectors 
N{0,Y, { g Si) ), where 



probability distribution of (<5i, • • • , 6 n ) and the associated expectation. We also 
denote by Es 8 and E^?j the expectation and conditional expectation associated 
respectively with Ps„ and Pj? . 

Next, for any 6,6' £ S p , the Kullback-Lcibler divergences K(F^ g , ,Ps 9 ) be- 
tween Ps and Ps e satisfies 




Thus, we have P' 



<8>™ =1 P Denote respectively by P5 and the 




(4.18) 



Set Si = (<5*,i0W,- 



,<5i, P ^ (p) ) T - I n vicw 01 (4-14), we have 




(4.19) 
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and lie. a, is the orthogonal projection onto l.s.(9g i ) (Note indeed that we have 
in general 10^ 1 2 < 1, therefore He,s t = |05 4 l2~ 2 ^<5i^J)- For any £ 9, we set 

01(0) = (01-0 2 )|%|| + 02. 



Fact 1: For any 1 < i < n, any 0,0' £ S p and any realization of Si £ {0, 1} P , 
we have 



jr(P„ w ,P„( 4|) 



1 / <7 2 ^(0') \ 1 /ai(0) 

2l^ + ^r- 2 J + 2 log ( v ^ 



1 



tr (ne.^n^.aj 



0i (0) ' ' ux{9) 



1 



02 <7i(0') 



fT 2 



We apply Fact 1 with 9 = 9q = -^^o and take the expectation w.r.t. Si for 
any i = 1, . . . , n. Thus, we get the following. 



Fact 2: Assume that > i + -q for some absolute 77 > 0. Taking a > 
sufficiently small (that can depend only on 77), we have for any i = 1, . . . , n, 
any 9' £ S p , that 



a p (^p 



^ 2 
< ■ 

~ 2a 2 



We immediately get from Fact 2 for any 9' £ that 

(5 2 n 



A' (I 



»=i 



A(P s( , i)) P v , ( 



9(1 



- ySilog(ep/si). 



4.7. Proof of Fact 1 



PROOF. In view of (4.19), we get for any 1 < i < n, any 9,9' £ S p and any 
realization Si £ {0, 1} P that P„ (ai ) < P_c« 4 ) and hence A"(P ^s^P (^j) < 00. 

Dchnc Ji = {j : Si j = 1, 1 < j < r} and = |Jj|. Define the mapping Pj : 
M p -> R dl as follows P t (x) = x(J,) where for any x = {x^\ ■ ■ ■ ,x^) T £ W, 
x(Ji) £ M di is obtained by keeping only the components x^ with their index 
k £ Ji. We denote by P* : R di — > W the right inverse application of Pj. We 
note that 

ps^p; = <7i(0)n e(i/j)A + cr 2 - n e(Ji)A ] , 

where Her j t ) ^ denotes the orthogonal projection onto the subspace 1.8.(0^ (Ji)) 

of R d \ Note also that PS^'P/ admits an inverse for any 9 £ S p provided that 
Si is not the null vector in MP and we have 

(ps^'p;)- 1 = _L-n e(i7i) , 5j + - - n HJi)A ] . 
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Thus, wc get for any 9, 9' € S p that 

K(P p ^ )pr P Pi ^ l))p: ) 

! _ /det (pE^P?) 



A-(P E (, 4 ),P E ( 4i) J 



= itr((P i Ef)p*)- 1 P,(S^ ) )P? 



loE 



det 



2 



1 

2 tr 



1 



n 



i , 

e(Ji),Si + — yi^ 

02 



n 



e(J,)Aj 



[o-i^JEj/^^ + CT2 (i* - II«'(J0,«*)] 



st (pe^p*) 



1 / Cr 2 



2 Vtri(0) 



1 / cr 2 



cr 2 



2 l0S Uet (pE^P? 

2) +-tr(n e(Ji)i5i n e , (Ji)A ; 



2 

Vi(^) 
.<n(0) 



g 2 
a x {6) 



a,{9') 



cr 2 



■log 



st (P,5#°i?) 



dot (pe^p*) 



2 \<r x {9) 



cr 2 



I + -tr(n Mi n e , A ) 



£7^') 



^2 ^l(^) 



<T 2 



where we have used oi(9) and tr 2 are eigenvalues of P^E^ P? with respective 
multiplicity 1 and di — 1 for any € 0, and also that tr (llgrjA^.IlQi^j.^gA = 
tr (n 9i<5i n 9 ', 5i ) for any 0, 0' € 0. 

4.8. Proof of Fact 2 



PROOF. 



For any i = 1, . . . , n, we have that <ji(9q) = {a\ — cr 2 ) <5 '' 1 ^ 1 ' 2 + ct 2 and 



<ri(0') - {a 1 -a 2 ) 



1-e' 



,(./) 



j=3 



si - 2 



Note that the random quantities in the last three displays depend on 6i i,6i t 2 
only through the sum Z\ := + <5j l2 ~ Bin(2, 5) and that is independent of 
(<^,3j ■ • " j 8i,p)- 
Thus, we get 



£7 2 



C7 2 



(Cl - £7 2 )(^,1 + cTi, 2 ) + 2(7 2 

5 2 cr 2 25(1 - 5)a 2 (1 - S) 2 

2<7i G\ + cr 2 2 
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Similarly, we obtain 



2a 2 



5{oi - a 2 )W\l + a 2 _ Sa x 1-5 



2a 2 



Combining the last two displays, we get 



(T 2 



vim 



i 



_2a 1 {8 ) 2a 2 
We study now the following quantity 



-tr (Ug ai g i Ilgi j s i 



2a 2 2 

5((Ti - er 2 ) 2 (ai + 5<t 2 ) 
2o\o 2 {o\ + a 2 ) 

o 2 o x {B') 



1 - 



<T 2 



we note first that 

1100,(5, 



and 



6%,i + Si- 



1-e 1 



Si,i 


Si,lSi,2 







Si,2 





O 


o 






9' Si 



s iA 






Si,lS it 2 


Si.2 




* 


* 


* 



Thus, we get that 
1 



tr (Il eoA Il 



9' Si 



Next, we set 



er(0o,0') 



/i 2\ ^,1 + Si.2 



<7 2 <7i(0') + (7 2 <7l(0 O ) - ^2 - ffl(e O )^l(0') 



CT2O"l(0o) 

If Zi = l, then (Ti(6» ) = {a x +a 2 )/2 and 

2o- 2 o-i(6>') + a 2 (ai + a 2 ) - 2a\ - {u x + v<i)o\{W) 



{vi-°2f\6' s f 2 

+ C2) 

If Zi = 2, then = <j\ and 



a 2 (ai + cr 2 ) 



(vi-°2) 2 \9' s ' 2 



'5, 12 



0"lCr 2 



(4.20) 
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Wc now freeze (0^3, • • • , 5i lP ) and compute the following expectation w.r.t Zi 



E 



z, 



jtr(n 9o , 4i IIfl», 4i )cr(0o,0') 



= -<5 2 (1 - e T 



(ai - cr 2 ) 2 

2(71(72 



<5(l-<5)(l- e 2 ) ^- (72 ) 2 



2(72 ((71 + 02 ) 

We note that the above display does not depend on (0^3, • • ■ , o,-, p ). Thus, we get 



-tr (n 9oA IV A )cr(0o,0') 



.fi (i _ e 2) (°^-^) a 

V 7 2(71(72 

— (5(1 — (5) (l — e 2 ) (g r g2) \ ^ 

V 71 7 2(7 2 ((71 +(7 2 ) 



Combining the above display with (4.20), we get 



2(7lf7 2 ((7l + (7 2 ) 
(3 ((71 - (7 2 ) 2 



2(7l(7 2 

5(1 _f) (l- £ ^) < gl -^ a . 

V ^ 7 2(7 2 ((7l +(7 2 ) 



2 (7l(7 2 ((7l + (7 2 ) 

We study now the logarithm factor 

A 2 := Je 4 log 



((l-(5)(l-2e 2 )(7i -<5e 2 ((7i +ct 2 )) 



<ri(flo) 

Ol(0') 



Recall that oi(#o) = (ci — cr 2 ) : | i + ct 2 with Z ?; = o^i + <?^ 2 ~ Bin(2, <5) and 



<Tl(f) = (<Tl - 02)10,5, || + (7 2 = ((7i - (7 2 ) 



2 si - 2 



<7 2 . 



with ^ = ~Y^j=3 w ~ Bin(si — 2,(5) and is independent of (0^1,0^2) 
We now freeze and take the expectation w.r.t. Zi. Thus, we get 



log 



0-1 (flo) 

01 (00 



(l-<5) 2 /(ai- 0-3)^=5^ + era 
o log 



(To 



(5(1 -5) log 



'(al-02)[(l-e 2 ) + - 2 ^2Z ^ ]+2o2 , 



£ / ((7 1 -q 2 )[(l-e 2 ) + i ^ 
2 ° S l 01 



01 + o 2 

Zi] + o 2 
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We study now the first term in the right-hand side of the above display. We 
have 



(o"i -0-2)^^2^ + 0-2 



(To 



l-6 2 +6 2 



1 2,2 

1 — e + e 



Oo 



si - 2 



E 

J=3 



Oj. 
0-2 



1 fcj + 1 



,0) 



si - 2 



since 5^f=3 w = s i ~ 2 by construction. Next, we notice that — log is convex. 
Thus applying Jensen's inequality twice gives that 

(0-1 - 0-2)^=2^ + o- 2 



E 



z, 



log 



fX 2 



< e^E 



2, 



log ( E 

i=3 



Oj, 

a 2 



1 Oj,j + 1 



,(./) 



si -2 



^- 2 E 



i=3 



log 



(To 



1 I fcj + 1 



<_^ 2 lo K (^ 

0-2 



We proceed similarly and obtain 



( ( T 1 -(T2)[(l-e 2 ) + - 2 ^ 2 -Z I ]+2 t T2 



Ol + 02 



2ct 2 2(o-i - o 2 )Z, 



^cti+o-2 (si - 2)(a 1 + a 2 ) J 

(1 _ 2 ) , 2 ^ (J) f 2o 2 + ^.j(oi -02) 
1 ' ^ si - 2 V 01 + o 2 

3=3 



and 



— lOE 



(01 -o 2 )[(l-e 2 ) + -^Zi] + 2a 2 



o~i + <y 2 

P (i) 
<-(l-<5)e 2 log 



2o 2 + ^i,j(o-l - (T 2 ) 



Ol + o 2 



2oo 



Ol + CT 2 



We obtain similarly 

(01 -o 2 )[(l-e 2 ) + -^Zi]+a 2 



(l-e 2 ) + e 2 



01 



Z i (a 1 — o 2 ) 02 
(si - 2)oi 01 



= (l- e 2 ) + e 2 E 



3=3 



si-2 



o 2 + S i j(a 1 - a 2 y 



01 
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and 



log 



i=3 



Sl -2 



E^j log 



02 + - CT 2 ) 



0"! 



Thus, we get 
2 

(1-5VJ 



e log 



log 



(l-J)lo; 
(o-i +CT2) 2 



< -(1 -<5)e 2 log 



(5(1 -5ft 2 lo, 
-21o 



2a 2 



2a 2 



(1-S)6\ 2 



v a 1 + a 2 
log(2) + 2tUog 



5 log 



e log 



Set A := Ai + A 2 . We have 

a S (O-I ~ ^2) 2 



2 0-1(72 (fl + (72) 



< ii e 2 _ go^j) 

- 2o- 2 2 



((l-«5)(l-2e 2 )(7 1 -<5e 2 (a 1 +(7 2 )) 



(1-6)6 



log 



((71 + £7 2 ) 



4(71(72 



2<51og( ^ 



< — e z 

~ 2(7 2 



5(1 - 5) 



((7l ~ (7 2 ) 2 
02 (f 1 +02) 
(0-1 - 0"2 ) 2 



((1 - 2e 2 )) - 



lo 



r / (01+02)' 

' V 4(71(72 



21og(^ 

02 



T ((l-2 e 2 ))- e 2 log(" (Jl((Tl+ 3 a2)2 

0-2(^1+ 02) U " & \ 4£7f 



We now show that if the absolute constant a > (recall that e = aa 
is taken sufficiently small, then we have 



((71 - g 2 ) 2 
0-2(0-1 +0-2) 



((l-2 e 2 ))- e 2 k,g 



01(0-1 + o- 2 ) 5 
4of 



>0, V0-1 > (1 +77)02. 
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We set u = a i — 02 and x = it/ (202). Then, we have 



(ax - cr 2 ) 2 
0-2(0-1 + o- 2 ) 



((l-2 e 2 ))- e 2 log 



0-1(0-1 + a-if 



4<xf 



0-2 (2o 2 + u 
,,2 



(l-2e 2 )-e 2 log 



(0-2 +m)(2o 2 +uf 



2a 2 (l+u/(2o 2 )) 



(l-2e 2 )-e 2 log 



> 



2o- 2 (l + w/(2o- 2 ) 
^2 



(1 - 2e 2 ) - 3e 2 log 1 



4o- 2 3 

u 

L + — 

0-2 

u 
o 2 



2o- 2 



> 2(1 - 2e 2 ) 



1 + x 



3e 2 log(l + 2x) > 0, Va;> 



provided that the numerical constant a > is taken sufficiently small (and this 
choice can depend only on 77 > 0). 
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